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(57) Abstract 

The system and method are used for recognising a time-sequential input pattern (20), which is derived from a continual physical 
quantity, such as speech. The system comprises input means (30), which accesses the physical quantity and therefrom generates a plurality 
of input observation vectors. The input observation vectors represent the input pattern. A reference pattern database (40) is used for storing 
a plurality of reference patterns. Each reference pattern consists of a sequence of reference units, where each reference unit is represented 
by at least one associated reference vector n t in a set {/*,} of reference vectors. A localizer (50) is used for locating among the reference 
patterns stored in the reference pattern database (40) a recognised reference pattern, which corresponds to the input pattern. 



FOR THE PURPOSES OF INFORMATION ONLY 



Codes used to identify States party to the PCT on the front pages of pamphlets publishing international 
applications under the PCT. 



AM 


Armenia 


AT 


Austria 


AU 


Australia 


BB 


Barbados 


BE 


Belgium 


BF 


Burkina Faso 


BG 


Bulgaria 


BJ 


Benin 


BR 


Brazil 


BY 


Belarus 


CA 


Canada 


CF 


Centra] African Republic 


CG 


Congo 


CH 


Switzerland 


a 


Cote d'lvoire 


CM 


Cameroon 


CN 


China 


CS 


Czechoslovakia 


CZ 


Czech Republic 


DE 


Germany 


DK 


Denmark 


EE 


Estonia 


ES 


Spain 


FI 


Finland 


FR 


France 


GA 


Gabon 



GB 


United Kingdom 


GE 


Georgia 


GN 


Guinea 


GR 


Greece 


HU 


Hungary 


IE 


Ireland 


IT 


Italy 


JP 


Japan 


KE 


Kenya 


KG 


Kyrgystan 


KP 


Democratic People's Republic 




of Korea 


KR 


Republic of Korea 


KZ 


Kazakhstan 


U 


Liechtenstein 


LK 


Sri Lanka 


LR 


Liberia 


LT 


Lithuania 


LU 


Luxembourg 


LV 


Latvia 


MC 


Monaco 


MD 


Republic of Moldova 


MG 


Madagascar 


ML 


Mali 


MN 


Mongolia 


MR 


Mauritania 



MW 


Malawi 


MX 


Mexico 


NE 


Niger 


NL 


Netherlands 


NO 


Norway 


NZ 


New Zealand 


PL 


Poland 


PT 


Portugal 


RO 


Romania 


RU 


Russian Federation 


SD 


Sudan 


SE 


Sweden 


SG 


Singapore 


SI 


Slovenia 


SK 


Slovakia 


SN 


Senegal 


sz 


Swaziland 


TD 


Chad 


TG 


Togo 


TJ 


Tajikistan 


TT 


Trinidad and Tobago 


UA 


Ukraine 


UG 


Uganda 


US 


United States of America 


uz 


Uzbekistan 


VN 


Vict Nam 



WO 97/08685 PCT/IB96/00849 

1 

Method and system for pattern recognition based on dynamically constructing a subset of 
reference vectors 



The invention relates to a method for recognising an input pattern which 
is derived from a continual physical quantity; said method comprising: 

accessing said physical quantity and therefrom generating a plurality of input 
observation vectors, representing said input pattern; 
5 locating among a plurality of reference patterns a recognised reference pattern, 

which corresponds to said input pattern; at least one reference pattern being a sequence of 
reference units; each reference unit being represented by at least one associated reference 
vector jx a in a set {jx a } of reference vectors; said locating comprising selecting for each input 
observation vector o a subset {7*,} of reference vectors from said set {/xj and calculating 
10 vector similarity scores between said input observation vector c> and each reference vector jT, 
of said subset {/*■}• 

The invention also relates to a system for recognising a time-sequential 
input pattern, which is derived from a continual physical quantity; said system comprising: 
input means for accessing said physical quantity and therefrom generating a 
15 plurality of input observation vectors, representing said input pattern; 

a reference pattern database for storing a plurality of reference patterns; at least 
one reference pattern being a sequence of reference units; each reference unit being 
represented by at least one associated reference vector /T a in a set {7t a } of reference vectors; 

a localizer for locating among the reference patterns stored in said reference 
20 pattern database a recognised reference pattern, which corresponds to said input pattern; said 
locating comprising selecting for each input observation vector <T a subset { /!,} of reference 
vectors from said set { j* a } and calculating vector similarity scores between said input 
observation vector "o and each reference vector jl, of said subset {7T,}; and 
output means for outputting said recognised pattern. 

25 



Recognition of a time-sequential input pattern, which is derived from a 
continual physical quantity, such as speech or images, is increasingly getting important. 
Particularly, speech recognition has recently been widely applied to areas such as Telephone 
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and telecommunications (various automated services), Office and business systems (data 
entry), Manufacturing (hands- free monitoring of manufacturing processes), Medical 
(annotating of reports), Games (voice input), voice-control of car functions and voice-control 
used by disabled people. For continuous speech recognition, the following signal processing 
5 steps are commonly used, as illustrated in figure 1 [refer L.Rabiner "A Tutorial on Hidden 
Markov Models and Selected Applications in Speech Recognition-, Proceeding of the I EEE , 
Vol. 77, No. 2, February 1989]: 

Feature analysis: the speech input signal is spectrally and/or temporally 
analyzed to calculate a representative vector of features (observation vector *o). 
10 Typically, the speech signal is digitised (e.g. sampled at a rate of 6.67 kHz.) 

and pre-processed, for instance by applying pre-emphasis. Consecutive samples 
are grouped (blocked) into frames, corresponding to, for instance, 32 msec, of 
speech signal. Successive frames partially overlap, for instance, 16 msec. Often 
the Linear Predictive Coding (LPC) spectral analysis method is used to calculate 
15 for each frame a representative vector of features (observation vector o). The 

feature vector may, for instance, have 24, 32 or 63 components (the feature 
space dimension). 

Unit matching system: the observation vectors are matched against an inventory 
of speech recognition units. Various forms of speech recognition units may be 

20 used. Some systems use linguistically based sub-word units, such as phones, 

diphones or syllables, as well as derivative units, such as fenenes and fenones. 
Other systems use a whole word or a group of words as a unit. The so-called 
hidden Markov model (HMM) is widely used to stochastically model speech 
signals. Using this model, each unit is typically characterised by an HMM, 

25 whose parameters are estimated from a training set of speech data. For large 

vocabulary speech recognition systems involving, for instance, 10,000 to 60,000 
words, usually a limited set of, for instance 40, sub-word units is used, since it 
would require a lot of training data to adequately train an HMM for larger 
units. The unit matching system matches the observation vectors against all 

30 sequences of speech recognition units and provides the likelihoods of a match 

between the vector and a sequence. Constraints can be placed on the matching, 
for instance by: 

Lexical decoding: if sub-word units are used, a pronunciation lexicon 
describes how words are constructed of sub-word units. The possible 
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sequence of sub- word units, investigated by the unit matching system, is 
then constrained to sequences in the lexicon. 

Syntactical analysis: further constraints are placed on the unit matching 
system so that the paths investigated are those corresponding to speech 
5 units which comprise words (lexical decoding) and for which the words 

are in a proper sequence as specified by a word grammar. 



A discrete Markov process describes a system which at any time is in one 
of a set of N distinct states. At regular times the system changes state according to a set of 

10 probabilities associated with the state. A special form of a discrete Markov process is shown 
in figure 2. In this so-called left-right model, the states proceed from left to right (or stay the 
same). This model is widely used for modelling speech, where the properties of the signal 
change over time. The model states can be seen as representing sounds. The number of states 
in a model for a sub-word unit could, for instance be, five or six. In which case a state, in 

15 average, corresponds to an observation interval. The model of figure 2 allows a state to stay 
the same, which can be associated with slow speaking. Alternatively, a state can be skipped, 
which can be associated with speaking fast (in figure 2 up to twice the average rate). The 
output of the discrete Markov process is the set of states at each instance of time, where each 
state corresponds to an observable event. For speech recognition systems, the concept of 

20 discrete Markov processes is extended to the case where an observation is a probabilistic 
function of the state. This results in a double stochastic process. The underlying stochastic 
process of state changes is hidden (the hidden Markov model, HMM) and can only be 
observed through a stochastic process that produces the sequence of observations. 

For speech, the observations represent continuous signals. The 

25 observations can be quantised to discrete symbols chosen from a finite alphabet of, for 

instance, 32 to 256 vectors. In such a case a discrete probability density can be used for each 
state of the model. In order to avoid degradation associated with quantising, many speech 
recognition systems use continuous observation densities. Generally, the densities are derived 
from log-concave or elliptically symmetric densities, such as Gaussian (normal distribution) 

30 or Laplacian densities. During training, the training data (training observation sequences) is 
segmented into states using an initial model. This gives for each state a set of observations. 
Next, the observation vectors for each state are clustered. Depending on the complexity of 
the system and the amount of training data, there may, for instance, be between a 32 to 120 
clusters for each state. Each cluster has its own density, such as a Gaussian density. The 
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density is represented by a reference vector, such as a mean vector. The resulting 
observation density for the state is then a weighted sum of the cluster densities. 

To recognise a single speech recognition unit (e.g. word or sub-word unit) 
from a speech signal (observation sequence), for each speech recognition unit the likelihood 
5 is calculated that it produced the observation sequence. The speech recognition unit with 
maximum likelihood is selected. To recognise larger sequences of observations, a levelled 
approach is used. Starting at the first level, likelihoods are calculated as before. Whenever 
the last state of a model is reached a switch is made to a higher level, repeating the same 
process for the remaining observations. When the last observation has been processed, the 

10 path with the maximum likelihood is selected and the path is backtraced to determine the 
sequence of involved speech recognition units. 

The likelihood calculation involves calculating in each state the distance of 
the observation (feature vector) to each reference vector, which represents a cluster. 
Particularly in large vocabulary speech recognition systems using continuous observation 

15 density HMMs, with, for instance, 40 sub-word units, 5 states per sub-word unit and 64 
clusters per state this implies 12800 distance calculations between, for instance, 32 
dimensional vectors. These calculations are repeated for each observation. Consequently, the 
likelihood calculation may consume 50% -75% of the computing resources. It is known from 
E. Bocchieri "Vector quantization for the efficient computation of continuous density 

20 likelihoods", Proceeding of ICASSP, 1993, pp. 692-695 to select for each observation vector 
"b a subset of densities (and corresponding reference vectors) and calculate the likelihood of 
the observation vector for the subset. The likelihood of the densities, which are not part of 
the selected subset, are approximated. According to the known method, during training all 
densities are clustered into neighbourhoods. A vector quantiser, consisting of one codeword 

25 for each neighbourhood, is also defined. For each codeword a subset of densities, which are 
near the codeword, is defined. This definition of subsets is done in advance, for instance 
during the training of the system. During recognition, for each observation vector a subset is 
selected from the predefined subsets by quantising the observation vector to one of the 
codewords and using the subset defined for the codeword as the subset of densities for which 

30 the likelihood of the observation is calculated. The disadvantage of this approach is that the 
subsets are statically defined based on the given reference vectors. Particularly for an 
observation vector which is near boundaries of the predetermined subsets, the selected subset 
may actually contain many reference vectors which are further from the observation vector 
than reference vectors in neighbouring subsets. Therefore, to achieve a low pattern error 
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rate, the selected subset needs to be relatively large. 
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It is an object of the invention to provide a method and system of the kind 
set forth for selecting for each observation vector a subset of reference vectors, which is also 

5 based on the observation vector. It is a further object to provide a method and system which 
gives the potential to recognise patterns with a lower pattern error rate. It is a further object 
to provide a method and system which gives the potential to reduce the percentage of 
processing time required for the maximum likelihood calculation, without significantly 
increasing the pattern error rate. 

10 To achieve this object, the method according to the invention is 

characterised in that selecting a subset {/xj of reference vectors for each input observation 
vector <> comprises calculating a measure of dissimilarity between said input observation 
vector 1) and each reference vector of said set {7T a } and using as said subset { jT a } of 
reference vectors a number of reference vectors n af whose measures of dissimilarity with 

15 said input observation vector o" are the smallest. By constructing the subset of reference 
vectors dynamically for each observation vector based on the dissimilarity measure, a subset 
is selected which with a high likelihood comprises reference vectors which are near the 
observation vector. This opens the way for more accurate pattern recognition. Furthermore, 
vectors which are not near the observation vector can, with a high likelihood, be excluded 

20 from the subset. This opens the way to faster pattern recognition. 

It should be noted that it is known from EP-A-627-726 to select a subset 
of reference vectors by organising the reference vectors, using a tree structure, and 
performing a tree search. At the highest level of the tree, the root node represents all 
reference vectors. At one level lower in the tree, a plurality of intermediate nodes each 

25 represent a disjunct subset of reference vectors, where the subsets together form the entire 
set of reference vectors. This is repeated for successive lower levels, until at the lowest level 
each of the leaf nodes of the tree represents an actual reference vector. During the pattern 
recognition, for each input observation vector a tree search is performed starting at one level 
below the root. For each node at this level, the likelihood is calculated that the observation 

30 vector was produced by the subset represented by the node. To this end, each subset is 
represented by a subset vector to which the observation vector is compared. One or more 
nodes with maximum likelihood are selected. For these nodes, each representing a different 
subset of reference vectors, the same process is repeated one level lower. In this manner, 
finally a number of leaf nodes are selected. The reference nodes represented by the selected 
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leaf nodes together form the finally selected subset of reference vectors. Using this method, 
the selected subset is based on the actual observation vector. Since the reference vectors are 
pre-arranged in a tree structure and are selected using the tree search, individual reference 
vectors are not compared to the observation vector during the selection of the subset. This 
5 adversely affects the likelihood that the nearest reference vector is a member of the selected 
subset. 

In a first version according to the invention, the method is characterised: 
in that said method comprises quantising each reference vector /T tt to a quantised 
reference vector R0O, and 
10 in that selecting the subset of reference vectors comprises for each input 

observation vector o the steps of: 

quantising said input observation vector o^ to a quantised observation 

vector R(o); 

calculating for said quantised observation vector R(o) distances d(R(o), 

15 R(m*)) to each quantised reference vector ROO; and 

using said distance d(R(o), R0O) as said measure of dissimilarity 
between said input observation vector "o and said reference vector /x a . 
By quantising the vectors, the complexity of the vectors is reduced, making it possible to 
effectively calculate the distance between the quantised observation vector and the quantised 

20 reference vectors. This distance between the quantised vectors, which can be seen as an 
estimate of the distance between the actual vectors, is used to select the subset. 

In a further version according to the invention, the method is 
characterised in that said quantising a vector "x (reference vector jT a or observation vector o) 
to a quantised vector R(x ) comprises calculating a sign vector S"(x ) by assigning to each 

25 component of said sign vector a binary value, with a first binary value bl being assigned if 
the corresponding component of the vector x has a negative value and a second binary value 
b2 being assigned if the corresponding component of the vector x" has a positive value. 
Using binary values allows for a very efficient calculation and storing using a micro- 
processor, whereas the sign vector provides a reasonable approximation of the vector. 

30 In a further version according to the invention, the method is 

characterised in that calculating said distance d(R(o), R(/T J) comprises calculating a 
Hamming distance H("S("o), "S(m"J) of the vectors *S("o) and S0O. The Hamming distance of 
binary vectors can be calculated very efficiently. 

In a further version according to the invention, the method is 
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characterised in that said quantising further comprises calculating an L^-norm of the vector x 
and multiplying said norm with said sign vector S(T). This provides a good approximation of 
the vector. Since the L f -norm of the reference vectors can be calculated in advance and the 
Ly-norm of the observation vector "o only needs to be calculated once for each observation 
5 vector, the additional calculations do not seriously affect the on-line performance. 

In a further version according to the invention, the method is 
characterised in that said quantising further comprises dividing said sign vector S(x) by the 
dimension of the vector x" to the power 1/r. This provides an even better approximation of 
the vector. 

10 In a further version according to the invention, the method is 

characterised in that calculating said distances d(R(o), R(JO) comprises: 
calculating the L r -norm || /x a || r of each vector ji a , and 
for each input observation vector cT: 

calculating the L r -norm |] o*|| r of the vector ~o; and 
15 calculating a Hamming distance H(S(o), S0O) of the vector S"(o) to 

each vector S(/xJ. 

Since the L r -norm || /T a || r of the reference vectors /T a can be calculated in advance and the L r - 
norm || "o || r of the vector "o only needs to be calculated once for each observation vector, the 
distance calculation has been reduced to primarily calculating the Hamming distance H(S(o), 
20 S0O). which can be calculated very efficiently. 

In a further version according to the invention, the method is 

characterised 

in that calculating said Hamming distance H(S(o), S0O) of the vectors "S(o") and S0O 
comprises: 

25 calculating a difference vector by assigning to each component of said 

difference vector the binary XOR value of the corresponding components of S(o) and SC/O; 

determining a difference number by calculating how many components in said 
difference vector have the value one, and 

using said difference number as the Hamming distance. 
30 The binary XOR value can be calculated very efficiently for many components in one 
operation. 

In a further version according to the invention, the method is 

characterised: 

in that said method comprises constructing a table specifying for each N- 
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dimensional vector, with components having a binary value of zero or one, a corresponding 
number, indicating how many components have the value one; and 

in that determining said difference number comprises locating said difference 
vector in said table and using said corresponding number as the Hamming distance. 
5 By counting in advance the number of one elements in a vector and storing this in a table, 
the performance is increased further. 

In a further version according to the invention, the method is 
characterised in that said method is adapted to, after having selected a subset of reference 
vectors for a predetermined input observation vector c> , use the same subset for a number of 
10 subsequent observation vectors. By using the same subset for a number of successive 
observation vectors, the performance is improved further. 

In a further version according to the invention, the method is 
characterised in that said method comprises: 

after selecting said subset of reference vectors for an input observation vector 
15 o, ensuring that each reference unit is represented by at least one reference vector in said 
subset, by adding for each reference unit, which is not represented, a representative 
reference vector to said subset. The accuracy of the recognition is improved if each reference 
unit is represented in the subset. 

In a further version according to the invention, the method is 

20 characterised 

in that said method comprises: 

after ensuring that each reference unit is represented by at least one reference 
vector in said subset, choosing for each reference unit said representative reference vector by 
selecting as the representative reference vector the reference vector from the subset, which 

25 represents said reference unit and has a smallest distance to said input observation vector "o . 
Since the observation vectors tend to change gradually, a reference vector which was found 
to be the best representation of a reference unit for a specific observation vector is a good 
candidate for supplementing the subset for a subsequent observation vector. 

To achieve the object of the invention, the system according to the 

30 invention is characterised in that selecting a subset {/!,} of reference vectors for each input 
observation vector cT comprises calculating a measure of dissimilarity between said input 
observation vector c> and each reference vector of said set { /T a } and using as said subset (JI t } 
of reference vectors a number of reference vectors j!,, whose measures of dissimilarity with 
said input observation vector c> are the smallest. 
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A first embodiment of a system according to the invention is characterised 
in that said reference pattern database further stores for each reference vector /T ft a quantised 
reference vector R(/T a ); and 

in that selecting the subset {7T,} of reference vectors comprises for each input 
5 observation vector c> the steps of: 

quantising said input observation vector "o to a quantised observation 

vector R(o); 

calculating for said quantised observation vector R(o) distances d(R(o), 
ROO) to each quantised reference vector R(aO; and 
10 using said distance d(R(o), R(TO) as said measure of dissimilarity 

between said input observation vector o and said reference vector /T a . 

A further embodiment of a system according to the invention is 
characterised in that for each reference vector "Jx a the reference pattern database comprises 
the L r -norm || fTJ| r of the reference vector 7i 8 ; and 
15 in that said localizer calculates said distances d(R(o), ROO) b Y» f° r cad* input 

observation vector *o : 

calculating the L r -norm [J "o" J| r of the vector ~o and a Hamming distance 
H(S(o) y SOO) of the vector S"(o) to each vector SOO, and 

combining the L r -norm |j"o || r and the Hamming distance H(S(o), S0O) 
20 with the L r -norm || /T tt [| r stored in said reference pattern database. 

A further embodiment of a system according to the invention is 
characterised in that said system comprises a memory for storing a table specifying for each 
N-dimensional vector, with components having a binary value of zero or one, a 
corresponding number, indicating how many components have the value one; and 
25 in that determining said difference number comprises locating said difference 

vector in said table and using said corresponding number as the Hamming distance. 

These and other aspects of the invention will be apparent from and 
elucidated with reference to the embodiments shown in the drawings. 

30 

Figure 1 illustrates the processing steps which are commonly used for 
continuous speech recognition, 

Figure 2 shows an example of a left-right discrete Markov process, 
Figure 3 shows a block-diagram of an embodiment of a system according 
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to the present invention, 

Figure 4 shows a 2-dimensional representation of a first vector 

quantisation, 

Figure 5 illustrates results obtained with the first vector quantisation, 
5 Figure 6 shows a 2-dimensional representation of a second vector 

quantisation, and 

Figure 7 illustrates results obtained with the second vector quantisation. 

10 Figure 3 shows a block diagram of a system 10 according to the 

invention, for recognising a time-sequential input pattern 20 which is derived from a 
continual physical quantity, such as speech or images. Input means 30 recurrently accesses 
the physical quantity. For speech, this usually involves sampling the physical quantity at 
regular intervals, such as 6.67 kHz. or 16 kHz. and digitising the sample. The input means 

15 30 processes a group of consecutive samples, corresponding to, for instance 32 msec, of 
speech signal, to provide a representative vector of features (the input observation vector o). 
In this way a time sequence of input observation vectors is generated, which represents the 
input pattern. Typically, the input means 30 may be implemented using a microphone, an 
A/D converter and a processor, such as a Digital Signal Processor (DSP). Optionally, the 

20 input means 30 may comprise a speech detector for effecting the sampling only when speech 
is effectively received. As an alternative to sampling and digitising the input signal, the 
signal may have been stored in memory in a digitised form or may be supplied digitally via a 
communication network. A reference pattern database 40 is used for storing reference 
patterns. As described earlier, speech recognition units are used as reference patterns for 

25 recognising speech. Each reference pattern comprises a sequence of reference units. Each 
reference unit is represented by at least one associated reference vector /T a . All reference 
vectors together form a set {^ 0 } of reference vectors. Using pattern recognition based on 
Hidden Markov Models, each reference pattern is modelled by a Hidden Markov Model, 
where the states of the model correspond to a reference unit. Using continuous observation 

30 densities, such as Gaussian or Laplacian densities, the reference vectors correspond to the 
mean vectors of the densities. The reference database 40 may be stored in memory, such as a 
harddisk, ROM or RAM as an integrated database or, alternatively, as separate data files. 

The system 10 further comprises a localizer 50 for locating in the 
reference pattern database 40 a reference pattern which corresponds to the input pattern. The 
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localizer 50 may be implemented using a DSP or micro-processor. The located reference 
pattern is referred to as the recognised reference pattern. As described earlier this involves 
calculating a likelihood of the observation vector. For each Hidden Markov Model and each 
state s of the model, the likelihood of an observation vector ~o is given by: 

5 

p(o) = 5>(*).p(o|*) 

where w(k) is the weight of the k-th observation mixture density (cluster) and N is the 
number of clusters for a state. For simplicity, the state s is not shown in the formula. Speech 
10 recognition systems usually use Laplacian or Gaussian probability densities to model the 
probability distribution of a cluster. Using the L r -norm, defined as: 

dfry) = |x -yll, ■ <E|x, ~ ^ 

«•> 

15 where the L,-nbrm is used for Laplacian densities and the I^-norm is used for Gaussian 
densities, gives as one of the possible formulas for the probability: 



p(ft *(k).a.e -*l*w»K 



20 where the reference vector /I(k) is the mean vector of the k-th observation mixture density. 
The coefficients a and b ensure that the probability integrates up to 1 if the observation 
vector cT is run over all possible values. Various forms or extensions of this formula are well 
known. As an example, the following three Gaussian densities are given: 

1 -{WttHU 



Full covariance matrix K sJi : p(o \ k) = 



Diagonal covariancematrix (K^)^ = <r d 2 : p(p\k) = 



^(27r)°dettf, t 

i 

— e 

1 



I ,- _„ 



ScalarvarianccK l k = l.a, x : p(o | it) = — e ' 
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It will be appreciated that also other distance measures than the L r -norm may be used- 
Ad vantageously, the observation vector "o and the mean vectors /T(k) are 
scaled before the likelihood calculation takes place. Scaling can be used to prevent that terms 
fall below the precision range of the processor and to normalise the vectors according to the 
variance of the density. The scaling may be performed by multiplying the vectors with a 
diagonal DxD matrix V, where D is the dimension of the feature vector space. 

. 0 



The matrix elements V, to V D are estimated during training. Starting with the complete set of 
15 reference vectors, for each component separately a common variance is calculated, resulting 
in one pooled variance vector. Of all reference vectors, the vector components are divided by 
the corresponding pooled standard deviation and re-scaled with a value y to bring the 
component in the required range. The value y is the same for each vector component. Other 
forms of scaling are well-known. Advantageously, the reference vectors are scaled in 
20 advance and the observation vector is only scaled once before starting the actual likelihood 
calculations. 

Due to the nature of the densities, the sum of probabilities can be 
approximated by the maximum, i.e. the density which contributes the largest probability. 
This implies that a key step in locating a reference pattern which corresponds to the input 
25 pattern is finding the reference vector which is nearest the observation vector (nearest 
neighbour search): 

p(o) ~ max{w(k).a.e- b **- mn: | k = 1,...,/V} 



By taking the logarithm, this gives: 
30 \og[p(d)) = -mm{b\\o-TL{k)\\ r r - log(>v(*)) | k = 1 M - log(a) 



The constant log(a) can be ignored. As an alternative to separately subtracting the term 
log(w(k)), new extended vectors "p and q*(k) may be introduced, defined by: 
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p T = (bo T 9 0), 

q(k) T = (b]l\ Hog(w(*))) 7 ) 

In this formula it should be noted that -log(w(k)) >0. Using the extended vectors gives: 
5 log(p(o)) ~ -min{\\p-q(k)\\ r r | k = 1 N) - log(a) 

Since, p" and "q(k) have one more component, their dimension is D+l, where D is the 
dimension of the feature vector space. 

In the remainder of the document reference is made to the vectors x and 

10 y, where x" = <> and ~y = It will be appreciated that the same concepts, as described 
next, can be applied to the vectors p and q by reading x = p and y = q . 

Conventional pattern recognition involves calculating vector similarity 
scores, such as the maximum likelihood or, as part of that, the distance, between the input 
observation vector 7T and each reference vector. As described earlier, the maximum 

15 likelihood calculation for large vocabulary systems may involve over 10,000 distance 

calculations between, for instance, 32 dimensional vectors. These calculations are repeated 
for each observation. Instead of calculating all distances in full, the localizer 50 of figure 3 
first selects a subset {7T,} of reference vectors from the total set {f^} of reference vectors. 
The localizer 50 then only calculates the full distance for reference vectors of the subset 

20 {/!,}. According to the invention the localizer 50 determines the subset {/T,} by calculating a 
measure of dissimilarity between the input observation vector 1> and each reference vector of 
the total set {JT a }. Then, a number of reference vectors for which the smallest measure of 
dissimilarity was calculated are used as the subset. The number may be chosen as a fixed 
number or a percentage of the total number of reference vectors. Alternatively, the localizer 

25 50 selects reference vectors jT a whose distances d(R(o), R(/T»)) are below a predetermined 
threshold T. Advantageously, tests are performed to determine an optimum number or 
threshold, which is sufficiently low to provide a considerable performance gain but does not 
too severely affect the accuracy of recognition. The measure of dissimilarity is used for 
obtaining a ranking of the reference vectors with regard to their dissimilarity ('distance') 

30 compared to the observation vector. The ranking does not need to be fully accurate, as long 
as with a high likelihood the subset contains the reference vector which is near the 
observation vector, using the actual distance measure (likelihood calculation) which is used 
for the pattern recognition. This makes it possible to put less constraints on the measure of 
dissimilarity then are put on the actual full distance measure. Advantageously, a measure of 
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dissimilarity is chosen which allows for a more efficient calculation of an approximate 
distance than the full distance calculation. 

In a further embodiment according to the invention the localizer 50 
determines the subset {fi t } by: 
5 - Quantising each reference vector jT a to a quantised vector R(/Tj. This can 

already be done during training, when the reference vectors are determined- In 
this case, the quantised vector R(70 can, advantageously, be stored in the 
reference pattern database 40 before the actual speech recognition starts. This 
improves the on-line performance of the system. 
10 - For each input observation vector ck 

quantising the input observation vector F to a quantised vector R(o), 
calculating distances d(R(o), R(/I a )) of the quantised vector R(o) to each 
quantised .vector R(/Tj, and 

using as the subset {/*,} a number of reference vectors /T a from the total 
15 set {jT a }, whose distances d(R(o), R(aO) are the smallest. As such, the 

distance d(R(o), R(fO) is used as the measure of dissimilarity between 
the observation vector "o and the reference vector /T a - 
The quantisation of a vector reduces the complexity of a vector. In the following 
embodiments various forms of quantisation are described, resulting in a simplified distance 
20 calculation between the quantised vectors. One approach is to reduce the complexity of the 
vector components, for instance by using as a quantised vector a vector which is proportional 
to a vector with binary components. Another approach is to use a quantised vector with less 
vector components (lower dimension) of similar complexity as the components of the original 
vector. Both approaches may also be combined, by using as a quantised vector a vector with 
25 some components of reduced complexity and the remaining components of similar complexity 
♦ as the original vector. 

In a further embodiment, as the quantised vector of a vector T a sign 
vector S"(T) is used. The same quantisation is used for both x" and y" . The components of 
S("z) have binary values representing the sign of the corresponding component of T. 

30 z - where 

5(z) = (siftn(z x ),...,siKn(z D )) T > and 
sign(z,) = 6/, if z, < 0, 
/>2, ifz. > 0, 
where i = 1 D 
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For components of S(z) where the corresponding component of 7 has the value zero, either 
bl or b2 may be selected. Preferably, the same choice is made for all quantised vectors. 
Alternatively, a third discrete value is used to represent zero value components. The 
definition has been given for a system in which the original vector components are 
5 represented using signed values, such as a signed 2-byte integers. It will be understood how 
the same concept can be applied to unsigned values, by assigning the lower half values 
(0..32767) to bl and the upper half values (32768.. 65535) to b2. 

A good illustration of the method is achieved by using -1 for bl and +1 

for b2: 

10 z -> 3(z), where 

5(z) = {sign(z t ),... 9 sign(z^) T , and 
sign(z,) = -1, if z k < 0, 
1, ifz i > 0, 
where i = 1 % ...,I> 

15 In this case the zero components are also represented by + 1 . Geometrically, this can be seen 
as projecting each vector onto a 'unity' vector in the middle of the same space sector as the 
vector itself. Figure 4 illustrates a 2-dimensional example in which the vector (3,1) T is 
projected onto the vector (1,1) T . The same applies for all vectors in the space sector 
(quadrant) I. Similarly, vectors in the space sector II are projected onto (-1,1) T ; vectors in 

20 the space sector III are projected onto (-1,-1) T ; vectors in the space sector IV are projected 
onto (1,-1) T As an example, in a 3-dimensional feature vector space the vector 7 = (51, - 
72, 46) T is quantised as follows: 

7 = (51, -72, 46) T - (1, -1, 1) T = S(7) 
The distance d r (7, y) between the vectors 7 and y* is replaced by the distance d r (ST(7), 

25 S("y )) between the respective sign vectors: 

Figure 5 illustrates this method for above given vector x and 10 extended 
3-dimensional vectors "y, .. 7, 0 , representing 10 mean vectors (prototypes). In the example, 
30 the vector elements are randomly chosen integers in the range -99 to +99. It should be noted 
that in speech recognition systems, the vector components will typically be continuous values 
(mathematically seen: elements from R), usually represented as 2- or 4-byte integers or 
floating numbers on a computer. The successive columns of the figure show the index i = 
0,..,9 of the ten vectors y^, the vector y" h the difference between 7 and y"i, the distance 



\ 
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di(x\ 7i) between the vectors, the sign vector of 7, the difference between the sign vectors 
S(x) and S"(7i)» and the distance d,(S(7), SC7)) between the sign vectors. Using the full 
distance calculation (L,-norm), the order in increasing distance is: 7*, 7?, 7i» 7o, 79, 74. 
7>, 75, 7s and y 2 . Using the approximated distance, the following four groups were 
5 established (in increasing distance): 7? and 7s, followed by 7o, 7i and 79, followed by 7?, 
73, 74, 7s, and y 6 . Already with only three dimensions the vectors are sorted relatively 
well. It will be appreciated that in practice good results are achieved if on average half of the 
sign vector components have the value 1 (i.e. the sign vectors are well distributed over the 
space and are not clustered too much in one space sector). Advantageously, this is achieved 

10 by performing the approximation separately for each reference unit (mixture). For each 
reference unit a mixture mean vector is calculated and subtracted from all reference vectors 
which represent the reference unit. Next the reference sign vectors are calculated. Before 
calculating the sign vector of the observation vector, the mixture mean vector of the 
reference unit is subtracted from the observation vector. Next, the observation sign vector is 

15 calculated and the observation sign vector is compared to all reference sign vector for the 
reference unit. This operation is performed for each reference unit separately. This operation 
can be seen as moving the coordinate system for each reference unit (all reference vectors 
representing it and the observation vector). This method has been tested in the Philips 
research system for large vocabulary continuous speech recognition. Using 32 reference 

20 vectors for each reference unit, experiments showed that it was sufficient to select a subset of 
approximately 20% of the reference vectors to achieve an accurate distance calculation. This 
reduced the required run-time for the distance calculation by 50%. 

In a further embodiment, the distance d(S"(x"), S(7)) of the vectors S"(7) 
and 3T(7) is calculated by calculating a Hamming distance H(S(x), sYy)) of the vectors 

25 SOO and SCy)- The Hamming distance can be calculated very fast. Preferably, the sign 
vectors are represented by vectors with only 0 and 1 elements. For instance, the 8- 
dimensional sign vector (1, -1, -1, -1, 1, 1, -1, 1) T can be represented by: (1, 0, 0, 0, 1, 1, 
0, 1) T . Advantageously, these vectors can be stored as a sequence of bits in a computer 
memory. Preferably, the vectors are aligned according to the preferred alignment of the 

30 computer, such as a byte, a 16-bit or a 32-bit word. As a next step in calculating the 

Hamming distance, the XOR function over the entire two vectors is calculated, providing a 
difference vector. Each component of the difference vector contains the binary XOR value of 
the corresponding components of S(7) and S"(7)- Most micro-processors allow an XOR 
function to be calculated over ah entire computer word in one operation. For instance, if 7 is 
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represented by (1, 0, 0, 0, 1, 1, 0, 1) T and 7 is represented by (1, 0, 1, 0, 0, 0, 1, 1) T , this 
gives the difference vector (0, 0, 1,0, 1, 1, 1, 0) T . In principle, the Hamming distance can 
now be calculated by counting the number of 1 -elements in the difference vector. In this 
example, the Hamming distance is four. If preferred, the Hamming distance may be 

5 multiplied by two to give the same distance as used in figure 5. 

In a further embodiment, the Hamming distance for each difference vector 
is calculated in advance and stored in the form of a table in memory 60 of figure 3. As an 
example, for the bit sequence (0, 0, 1,0, 1, 1, 1,0) = 2E (hex) the Hamming distance four 
can be stored at the 46-th (=2E (hex)) entry in the table. If the dimension of the vector 

10 space gets too large (e.g. more than 16), as an alternative to storing the full table, 

advantageously the vector can be broken into smaller units of, for instance, 16-bit. For each 
unit the table is used to determine the Hamming distance of the unit. The complete Hamming 
distance is obtained by summing the Hamming distances of each separate unit. 

It will be appreciated that a further improvement in performance is 

15 achieved by reducing the dimension of the sign vector, i.e. by dropping some of the 
components. As an example, only half of the components are used (e.g. the first D/2 
components) and the resulting estimated distance is multiplied by two to compensate for this. 
Advantageously, the number of components is reduced to a preferred size for processing. For 
instance, for many micro-processors it will be advantageous to reduce a 35-dimensional 

20 vector to a 32-dimensional vector with 1-bit components, forming a 32-bit computer-word. 

It will be understood that, instead of quantising a vector component by 
assigning it to one out of two regions (binary representation), also more than two regions 
may be used. As an example, three regions may be used, as illustrated in figure 6 for one 
component of the vector (the xl component). The components of the quantised vector R(z) 

25 have 2-bit values representing the corresponding component of T. 

z 7?(z), where 
*(z) = Wz t )....JlzJ) T , and 
Az t ) - W, ifz, < a n 

4 01\ if a. <; z, < b n 

30 where i - 1,...,D 

As illustrated in figure 6, similarly as before, the binary XOR operation and the hamming 
distance calculation can be used to calculate a distance between the quantised vectors and to 
give an indication of the distance of the actual vectors, allowing a subset to be determined. 

In an alternative embodiment, in addition to the sign vector also the norm 
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of the vector is taken into account, providing a more accurate approximation. The 
approximation is based on the insight that by writing a vector T as: 

D~ r 

5 

the first term is a good approximation for T. Geometrically, this can be seen as projecting 
each vector onto the diagonal in the same space sector as the vector itself by rotating the 
vector. Figure 7 illustrates a 2-dimensional example, using the Lz-norm, in which the vector 
(3,1) T is projected onto the vector (>/5,V5) T . 
10 For the 3-dimensional example vector "x 1 = (51, -72, 46) T and using the 

L,-norm, this gives: 

~F = (51, -72, 46) T = (1, -1, 1) T .56Y3 + A7 = (56V 3 , -56V 3 , 56 1 / 3 ) T + 

AT, 

where Ax = (-5V 3 , -15%, -10%) T 
15 For the distance between x and y , this gives: 

P - y\\r ■ 4™ * € - where 
d. = ||3(I)il|L-3(y)iZ|:|L 



20 The triangle inequality gives: 

6 < || Ax || * || A3? II 

Ignoring the error e, the distance d( x\ y") between x" and y" is approximated by djy. Figure 8 
illustrates this method for above given vector "x and 10 extended 3-dimensional vectors y" , 

25 "y 10 , representing 10 mean vectors (prototypes). The last two columns of the figure show the 
approximated distance d^ and the full distance between the vectors x" and y"j, using the L,- 
norm. Using the full distance calculation, the order in increasing distance is: "y 8 , y" 7 , "y i, "y 0 . 
T9> y*> y<>* ys> and y^- Using the approximated distance, the order in increasing distance 
is: "y 8 , y7> To* 7i» y*> y^ Y4, Tiy Tsy and y" 3 . Already with only three dimensions of the 

30 vectors a good sorting of the vectors has been achieved. 

This method has been tested in the Philips research system for large 
vocabulary continuous speech recognition. Using 32 observation density components 
(clusters), the following results have been achieved: 
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Number of calculated 
distances 


run time [%] 


word errors [%] 


Full 


32 


1.00 


24.52 


HVQ-1 


23.9 


0.75 


24.52 


HVQ-2 


21.3 


0.69 


24.52 


HVQ-3 


18.6 


0.62 


24.52 


HVQ-4 


15.9 


0.53 


25.07 


HVQ-5 


13.3 


0.48 


25.07 


HVQ-6 


10.6 


0.42 


25.07 


HVQ-7 


7.9 


0.34 


25.07 


HVQ-8 


5.3 


0.29 


25.34 



In the table, the first row of results shows the results of the conventional method of 
calculating all distances in full. The following eight rows show the results using the 
described method (referred to as HVQ), each time calculating a different number/percentage 

15 of distances in full. As shown in the table, by only calculating apprpx. 20% of the distances 
in full a word error rate is achieved which is still close to calculating all distances fully, but 
requiring approx. 30% of the computing time for the nearest neighbour search. Since the 
nearest neighbour search may require 50-75% of the total processing time, this reduces the 
total processing time by approx. 35-50%. 

20 To achieve these results, further optimisations to calculate d-jy have been 

used as described for the further embodiments. It will be appreciated that for an 
approximation of the vector T also can be used: 

2 -3®. Ml, 

25 Although this approximation is less accurate than the previous approximation, the difference 
is a multiplication by a constant l/D I/r , which does not affect the ranking of the vectors. 

Using the definition of the Lr-norm, the approximated distance d^y can be 

written as: 
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Considering that: 



this gives: 



10 Defining: 



20 



^ = (i:±\m\\nrmBL\ r ) 



IWPIL-^IUILl .- ISOUPIL - «ylU, if*& -s^) 

■ |3J®|.im,* |y| r |. W = -3XS> 

w/jere 13^1 = 1 



/:3;Cc)-3;CP) ^ 



Since the dimension of the vectors is D, and according to the given definition D.q^y elements 
15 have the same sign, this implies that D - D.c^y = D.(l - q^) elements have a different sign. 
This gives: 

General: (d-Y = q-\ |jr|, - \y\X ♦ (1 - <7_)| |2| r ♦ RylLl' 

For Laplacian densities (L,-norm) this gives the following approximation for the distance: 
20 Laplacian: \\x - y|, « + - 2 % min(||x||„ flyjl,) 

For Gaussian densities (l^-norm) this gives the following approximation for the distance: 

Gaussian: |x -y« - * H ^ 1J I - 2(2^ - l)|x|,|y ||, 

25 Consequently, by only using qjy and the norms of the vectors "x and y" 

the distance can be estimated. It will be appreciated that the norm of the vectors 7 can be 
calculated in advance and, advantageously, stored in the reference pattern database 40 of 
figure 3. Furthermore, the norm of the vector T can be calculated once and used for 
calculating the distance to each vector 7- For calculating q^y, it should be noted that D - 

30 D.q^y is the Hamming distance of the sign vectors of 7 and 7- By defining h^ as the 

Hamming distance of the sign vectors of x" and 7» ^ e previous three formulas can be written 
as: 
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General: (fif = d - %)l PL - 1*1,1' * %l PL * |y| r l' 



and 



Laplacian: \\x - y|, - |x|, * - 2(1 - ^2)min(||*|| 



and 



Gau^/on: p - ?H = Pll', *■ Ml - 2(1 - 2^)«3c|Jy|| J 

Consequently, by using the Hamming distance hjy and the norms of the 
vectors 1c and "y the distance can be estimated. It will .be appreciated that the accuracy can be 
improved by performing the operation on part vectors. The vectors x and y are broken into 
N part vectors and part norms are calculated. For the vector x this is defined as: 

This gives for Laplacian densities: 



Laplacian: |x - y|, ~ 11*11, «■ "£(2(1 - .^^)min(|^| „ ftl,)) 

In this formula h("x if y^ is the hamming distance of the part vectors x t and y h If the 
Hamming distance is calculated by using a table for vector units, as described earlier, 
advantageously the same size vector units are used for the table as are used for the part 
15 vectors. 

In a further embodiment, after a subset of reference vectors has been 
selected for an observation vector, the same subset is used for a number N of successive 
observation vectors, with N > 1. So, the same subset is used for N 4- 1 observations. For 
the observation, which follows the N + 1 observations, a new subset of reference vectors is 
20 selected. This mechanism is repeated for the following observations. The number N may be 
selected by performing experiments. In the Philips research system for large vocabulary 
continuous speech recognition good results have been achieved by using N = l (two 
successive observation vectors using the same subset). As an alternative to using a 
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predetermined number N, advantageously the localizer 50 of figure 3 dynamically decides 
whether to keep the subset, which was the determined for an earlier observation vector, or to 
determine a new subset for the current observation vector. Preferably, the localizer 50 makes 
the decision based on the dynamic behaviour of the observation vector o\ For instance, the 
5 localizer 50 can remember the observation vector "o for which the subset was calculated. As 
long as subsequent observation vectors are near the memorised observation vector (i.e. the 
distance is below a predetermined threshold), the same subset is kept. A new subset is 
determined, if a subsequent observation vector is further away from the memorised 
observation vector. 

10 In a further embodiment, the localizer ensures that for each reference unit 

at least one reference vector, which represents the reference unit, is a member of the subset 
of reference vectors. To this end, the localizer 50, after having selected an initial subset for 
an observation vector, verifies whether each reference unit is represented in the initial subset. 
If not, the localizer 50 adds for the reference units, which were not represented, a reference 

15 vector, which is representative for the reference unit. For each reference unit, a 

representative reference vector may be chosen during training. As an example, the reference 
vector could be chosen which is nearest the weighted average of all reference vectors 
representing a reference unit. If a representative reference vector is chosen in advance, this 
vector may advantageously be stored in the reference pattern database 40. 

20 In a further embodiment, the representative reference vector is chosen 

dynamically. If in the initial subset for an observation vector a reference unit is not 
represented, then as a representative reference vector a reference vector is added which for 
the previous observation vector was found to best represent the reference unit: i.e. had the 
smallest distance to the previous observation vector. To this end, the localizer 50, after 

25 having selected an initial subset for an observation vector, checks whether each reference 
unit is represented in the subset. If not, the representative reference vectors for the un- 
represented reference units are added. Next, the full distance of the observation vector to 
each reference vector in the subset is calculated, followed by selecting for each reference unit 
a reference vector, which represents the reference unit and has the smallest distance to the 

30 observation vector. This reference vector is chosen as the representative reference vector and 
may be used as such if the subset for the next observation vector requires it. It should be 
noted that a representative reference vector may stay the same for a large number of 
consecutive observation vectors. Using the dynamic approach, for each reference unit an 
initial representative reference vector needs to be selected, to be used to complement the first 
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observation vector, if required. In principle, any reference vector which represents a 
reference unit may be chosen as the initial representative reference vector. Advantageously, 
the reference vector which is nearest the weighted average of all reference vectors 
representing a reference unit is chosen as the initial representative reference vector. 
5 The description focuses on determining the distance of one observation 

vector to reference vectors, being a key step in pattern recognition and the subject of the 
invention. It is well understood in the art how this key element can be used in combination 
with other techniques, such as Hidden Markov Models, to recognise a time-sequential 
pattern, which is derived from a continual physical quantity. Using such techniques, for each 

10 observation vector a vector similarity score, such as a likelihood, between the observation 
vector and each reference vector of the subset is calculated. For each reference pattern the 
vector similarity scores of the reference vectors, which represent the reference pattern, are 
combined to form a pattern similarity score. This is repeated for successive observation 
vectors. The reference pattern for which an optimum, such as a maximum likelihood, is 

15 calculated for the pattern similarity score is located as the recognised pattern. Output means 
70 are used for outputting the recognised pattern. This may take various forms, such as 
displaying the recognised pattern on screen in a textual format, storing the recognised pattern 
in memory or using the recognised pattern as input, such as a command, for a next 
processing operation. It is also well understood in the art how techniques, such as a levelled 

20 approach, can be used to recognise patterns which comprise a larger sequence of observation 
vectors than the reference patterns. For instance it is known how to use sub- word units as 
reference patterns to recognise entire words or sentences. It is also well understood how 
additional constraints, such as a pronunciation lexicon and grammar, may be placed on the 
pattern recognition. The additional information, such as the pronunciation lexicon, can be 

25 stored using the same memory as used for storing the reference pattern database. 

Although the description highlights speech recognition, it will be 
understood that the invention can be applied to any type of pattern recognition in which the 
pattern has a time-sequential nature and is derived from a continual physical quantity. 

The invention can be summarised in mathematical terms in the following 

30 way. A finite number of K reference vectors 7*i to ^Tr are given, together forming a set P = 
{Mi Mk} of reference vectors. The components of the vectors usually represent a 
continuous value (element of R), making P C R D , where D is the dimension of the reference 
vector. It should be noted that in practical implementations typically 2-byte integers are used 
instead of continuous values. The task is to find in an efficient way for each input 
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observation vector o G O (where O is the set of possible observation vectors, O = R D ) a 
reference vector j* opI which is nearest the observation vector o* : 

]l opt = argmin d(jl,d) 

The computationally complex distance calculation can be reduced by reducing the set P to a 
subset of P, which with a high likelihood contains /T opl and by searching for 7T opl in that 
smaller subset. The subset is chosen for each observation vector "o. A subset function s can 
be defined which gives for each observation vector "o a subset P s of P: 
s: O -* p(P), where p(P) is the power set (set of all subsets) of set P 
This gives the reduced task of finding a reference vector /?: 

/T = argmin dfco) 

P, = s(o) 
P s C P 

The subset P s may not contain /T opt . The function s is designed in such a way that the 
likelihood of p(P) containing /I 0 * is high, making it likely that a? = 7T opl . 

In the approach as is known from E. Bocchieri "Vector quantization for 
the efficient computation of continuous density likelihoods", Proceeding of ICASSP, 1993, 
pp. 692-695 first the reference vectors are clustered into neighbourhoods. Each 
neighbourhood is represented by a codeword 7 h i = 1..N. The set V of codewords °v t is 
substantially smaller than the set P. The codewords are derived from P. For each 
codeword 7j an associated subset P t of P is determined in advance. During recognition, an 
observation vector o is quantised to a codeword v j G V using conventional codeword 
quantisation. The predetermined subset Pj associated with T j is used as the subset for o". 

clustering: P — V = {vj}, and V P. Gp(P) 
codeword quantising: o — V. 

giving: s: O — V p(P) 

According to the invention, a quantisation function R is used to quantise the reference 
vectors m as well as the observation vector o\ Therefore, R induces a mapping P Pr and 
O — Or. The distances d(R(o), R(/x)) between the quantised observation vector R(o) and 
each quantised reference vector R(JI) are calculated. Based on these distances, a subset of 
quantised reference vectors is selected. This can be described by a subset function Sr in the 

SUBSTITUTE SHEET (RULE 26) 
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Pr space: 

5^(5)) = {7?0D € d(R(o)JHji)) < *(*,/>)} 

5 The threshold function t(o", P) provides a threshold. Alternatively, the selection may be 
based on other mechanisms, such as a predetermined number or a percentage. For each 
subset of quantised reference vectors, the corresponding subset of reference vectors can be 
determined, as such the quantisation also induces an inverse map R*: p(Pr) -* p(P), such 
that: 

10 R # (M) = {m € P: ROT) G M}. 

In practice, typically a fixed relationship is maintained between each reference vector and its 
quantised reference vector. This defines s as indicated in the following diagram: 

s 

s: O -* g>(P) 

15 tTT 

s m 

In a formula s is given by: 

s(o) = G P: rf(7f(5),*GD) < r(5,/>)} 

20 

Since the quantisation of the reference vectors can be done in advance (e.g. as part of the 
training) and the quantisation of the observation vector only needs to be performed once, 
computational savings will in general be achieved if the calculation of the distance between 

25 the quantised vectors is simpler than the calculation between the original vectors. 

One approach for achieving this is to use a quantisation function which 
reduces the complexity of the components of the vectors, by quantising the vector 
components to a smaller discrete set B, giving R: R D B D . On some processors, for 
instance, savings can be made by reducing a vector component from a 2-byte value 

30 (representing a continuous value) to 1-byte value. As has been shown, considerable savings 
can be achieved by reducing the vector components to binary values, for instance B = {-1, 
1} (which can be represented as {0, 1}). An example of such a quantisation function is R(7) 

= (sign(x,) sign(x D )) T . Good results have also been achieved by using as the quantised 

vector a binary vector multiplied by a scalar, such as the norm of the original vector. This 
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can be seen as a quantisation function R(7) = (sign(x, ),..., sign(x D ), || x || ) T inducing a 
mapping R: R D B D xR', with a special distance measure for the D+l dimensional vectors. 
As shown before, the use of part norms can increase the accuracy further. An example of 
thisjs using a quantisation function R(7) = (sign(x, ),..., sign(x D ), || 7, || , || 7 2 1| 
|| x F || ) T inducing a mapping R: R° - B D xR F , with a special distance measure for the D+F 
dimensional vectors. 

Another approach for quantising is to reduce the dimension of the vector, 
keeping vector components of similar complexity, giving R: R D -* R F or R: R D — B F , with F 
< D. Examples of this are using part norms or using a subset of the sign vector 
components. 

Obviously, both approaches can also be combined, giving R: R D - B E x 
R F , with E D, F «£ D. 
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CLAIMS: 



1. A method for recognising an input pattern which is derived from a 
continual physical quantity; said method comprising: 

accessing said physical quantity and therefrom generating a plurality of input 
observation vectors, representing said input pattern; 

5 locating among a plurality of reference patterns a recognised reference pattern, 

which corresponds to said input pattern; at least one reference pattern being a sequence of 
reference units; each reference unit being represented by at least one associated reference 
vector in a set {7T a } of reference vectors; said locating comprising selecting for each input 
observation vector 7) a subset {/T 8 } of reference vectors from said set {/T a } and calculating 

10 vector similarity scores between said input observation vector c> and each reference vector p % 
of said subset {/is}> 

characterised in that selecting a subset {/!,} of reference vectors for each input observation 
vector o* comprises calculating a measure of dissimilarity between said input observation 
vector o and each reference vector of said set {/T a } and using as said subset {/*,} of 
15 reference vectors a number of reference vectors /T a » whose measures of dissimilarity with 
said input observation vector 7> are the smallest. 

2. A method as claimed in claim 1, characterised: 

in that said method comprises quantising each reference vector /x a to a quantised 
reference vector R(/T a )> and 
20 in that selecting the subset of reference vectors comprises for each input 

observation vector "o the steps of: 

quantising said input observation vector 7> to a quantised observation 

vector R(o); 

calculating. for said quantised observation vector R(o) distances d(R(o), 
25 R(/Ta)) to each quantised reference vector ROO; and 

using said distance d(R(o), R(/0) as said measure of dissimilarity 
between said input observation vector c> and said reference vector /T a . 

3. A method as claimed in claim 2, characterised in that said quantising a 
vector x (reference vector /x a or observation vector o) to a quantised vector R(x) comprises 
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calculating a sign vector S"( T) by assigning to each component of said sign vector a binary 
value, with a first binary value bl being assigned if the corresponding component of the 
vector x" has a negative value and a second binary value b2 being assigned if the 
corresponding component of the vector x" has a positive value. 
5 4. A method as claimed in claim 3, characterised in that said quantising 

further comprises calculating an L r -norm of the vector x" and multiplying said norm with said 
sign vector S(x). 

5. A method as claimed in claim 4, characterised in that said quantising 
further comprises dividing said sign vector S( x) by the dimension of the vector "x to the 

10 power 1/r. 

6. A method as claimed in claim 3, characterised in that calculating said 
distance d(R(o), R(/Tj) comprises calculating a Hamming distance H(S(o), S(/xJ) of the 
vectors sT(o) and SOO. 

7. A method as claimed in claim 4 or 5, characterised in that calculating said 
15 distances d(R(6), ROO) comprises: 

calculating the L r -norm ||/x a ||r of each vector jx 0 , and 

for each input observation vector cT: 

calculating the L r -norm || "o || r of the vector <7; and 

calculating a Hamming distance H("S(o), SOO) of the vector ST (o) to 

20 each vector S0O. 

8. A method as claimed in claim 6 or 7, characterised in that calculating said 
Hamming distance H(S(o), SOO) of the vectors S(o) and SOO comprises: 

calculating a difference vector by assigning to each component of said 
difference vector the binary XOR value of the corresponding components of Sf(o) and S(/iJ; 
25 determining a difference number by calculating how many components in said 

difference vector have the value one, and 

using said difference number as the Hamming distance. 

9. A method as claimed in claim 8, characterised: 

in that said method comprises constructing a table specifying for each N- 
30 dimensional vector, with components having a binary value of zero or one, a corresponding 
number, indicating how many components have the value one; and 

in that determining said difference number comprises locating said difference 
vector in said table and using said corresponding number as the Hamming distance. 

10. A method as claimed in any one of the preceding claims, characterised in 
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that said method is adapted to, after having selected a subset of reference vectors for a 
predetermined input observation vector "o , use the same subset for a number of subsequent 
observation vectors. 

11. A method as claimed in any one of the preceding claims, characterised in 

5 that said method comprises: 

after selecting said subset of reference vectors for an input observation vector 
o, ensuring that each reference unit is represented by at least one reference vector in said 
subset, by adding for each reference unit, which is not represented, a representative 
reference vector to said subset. 
10 12. A method as claimed in claim 11, characterised in that said method 

comprises: 

after ensuring that each reference unit is represented by at least one reference 
vector in said subset, choosing for each reference unit said representative reference vector by 
selecting as the representative reference vector the reference vector from the subset, which 
15 represents said reference unit and has a smallest distance to said input observation vector c> . 

13. A system for recognising a time-sequential input pattern, which is derived 
from a continual physical quantity; said system comprising: 

input means for accessing said physical quantity and therefrom generating a 
plurality of input observation vectors, representing said input pattern; 

20 a reference pattern database for storing a plurality of reference patterns; at least 

one reference pattern being a sequence of reference units; each reference unit being 
represented by at least one associated reference vector /x a in a set of reference vectors; 

a localizer for locating among the reference patterns stored in said reference 
pattern database a recognised reference pattern, which corresponds to said input pattern; said 

25 locating comprising selecting for each input observation vector "o a subset { jT,} of reference 
vectors from said set {/T a } and calculating vector similarity scores between said input 
observation vector cT and each reference vector /T, of said subset {/!,}; and 

output means for outputting said recognised pattern; 
characterised in that selecting a subset {/!,} of reference vectors for each input observation 

30 vector "o comprises calculating a measure of dissimilarity between said input observation 
vector cT and each reference vector of said set { /T a } and using as said subset {ji t } of 
reference vectors a number of reference vectors 7* a » whose measures of dissimilarity with 
said input observation vector o^ are the smallest. 

14. A system as claimed in claim 13, characterised: 
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in that said reference pattern database further stores for each reference vector 
/T a a quantised reference vector R(7Tj; and 

in that selecting the subset {/T,} of reference vectors comprises for each input 
observation vector <> the steps of: 
5 quantising said input observation vector U to a quantised observation 

vector R(o); 

calculating for said quantised observation vector R(o) distances d(R(o), 
R(m*)) to each quantised reference vector R(/xJ; and 

using said distance d(R(o), R0O) as said measure of dissimilarity 
10 between said input observation vector 7> and said reference vector /T a . 

15. A system as claimed in claim 14, characterised: 

in that for a vector x (reference vector /T a or observation vector o) the 
quantised vector R(x) is proportional to a sign vector S(x"); each component of said sign 
vector S(x) comprising a first binary value bl if the corresponding component of the vector 
15 x has a negative value and a second binary value b2 if the corresponding component of the 
vector 7 has a positive value. 

16. A system as claimed in claim 15 , characterised in that said quantised 
vector R(x) is proportional to an L r -norm of the vector x\ 

17. A system as claimed in claim 15, characterised in that calculating said 
20 distance d(R(o), R(/T J) comprises calculating a Hamming distance H(S(o), S" 00) of the 

vectors *S(o) and S"(/Tj. 

18. A system as claimed in claim 16, characterised: 

in that for each reference vector /T a the reference pattern database comprises the 
L r -norm || /T. || r of the reference vector 7T 0 ; and 
25 in that said localizer calculates said distances d(R(o), R(/xJ) by, for each input 

observation vector "o : 

calculating the L r -norm || <7|| r of the vector o* and a Hamming distance 
H(S"(<>), S" Olj) of the vector jf(o) to each vector SOO, and 

combining the L r -norm flo" | r and the Hamming distance H(S"(o), ST(/Tj) 
30 with the L f -norm || m.L stored in said reference pattern database. 

19. A system as claimed in claim 17 or 18, characterised in that calculating 
said Hamming distance H(S"(o), SOO) of the vectors J(o) and "SOO comprises: 

calculating a difference vector by assigning to each component of said 
difference vector the binary XOR value of the corresponding components of S"(o) and S0O; 
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determining a difference number by calculating how many components in said 
difference vector have the value one, and 

using said difference number as the Hamming distance: 
20. A system as claimed in claim 19, characterised: 

5 in that said system comprises a memory for storing a table specifying for each 

N-dimensional vector, with components having a binary value of zero or one, a 
corresponding number, indicating how many components have the value one; and 

in that determining said difference number comprises locating said difference 
vector in said table and using said corresponding number as the Hamming distance. 
10 21. A system as claimed in any one of the claims 13 to 20, characterised in 

that said localizer, after having selected a first corresponding subset of reference vectors for 
a predetermined input observation vector o , uses said first subset as said corresponding 
subset of reference vectors for a number of successive observation vectors. 

22. A system as claimed in any one of the claims 13 to 21, characterised in 
15 that said localizer, after selecting said subset of reference vectors for an input observation 

vector o\ ensures that each reference unit is represented by at least one reference vector in 
said subset, by adding for each reference unit, which is not represented, a representative 
reference vector to said subset. 

23. A system as claimed in claim 22, characterised in that said localizer, after 
20 ensuring that each reference unit is represented by at least one reference vector in said 

subset, chooses for each reference unit said representative reference vector by selecting as 
the representative reference vector the reference vector from the subset, which represents 
said reference unit and has a smallest distance to said input observation vector "o . 
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